library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(corrplot)
# read file
social <- read.csv(url("https://raw.githubusercontent.com/jgscott/STA380/master/data/social_marketing.csv"), row.names = 1)
#drop columns spam, adult, and uncategorized from file
social = subset(social, select = -c(spam,adult,uncategorized))
corrplot(cor(social), type = "lower")
From the correlation plot, we can identidy several variables with high correlation. For example: personal_fitness and health_nutrition
knitr::include_graphics("https://github.com/tiffblahthegiraffe/STA380_HW/blob/master/plot.png")
With excel, we looked further into the correlations between variables, and decided to manually group several highly correlated variables together into six hypothetical groups/clusters, as displayed below:
knitr::include_graphics("https://github.com/tiffblahthegiraffe/STA380_HW/blob/master/group.png")
Let’s dig into each group a little more:
Group 1 is influencer, whom can be seem as the social media savvy millennials that like sharing their lifestyle online
Group 2 represents businessman, a group of corporate people that enjoy politics, news, and business Group 3 are artists that talks about art, music and films on social media
Group 4 captures familyguy. These people are very family oriented, with interests including religion, family, and parenting
Group 5 is “dude.” Think about a college male that likes online gaming and sports.
Group 6 represents “fit.” These people care about their fitness, shape, nutrition consumption, and hope to stay active
Now, with six hypothetical groups, we want to use clustering to see if our correlation-based groups serve as a group proxy as NutrientH20’s market segmentation
#create hypothetical groups in dataframe
social["influencer"] = social$chatter+social$photo_sharing+social$shopping+social$current_events+social$dating + social$cooking+social$beauty+social$fashion
social["businessman"] = social$travel + social$politics +social$computers +social$news + social$automotive +social$business
social['artists'] = social$tv_film +social$art +social$music +social$crafts +social$small_business
social['familyguy'] = social$sports_fandom + social$religion +social$parenting +social$school +social$food +social$family +social$home_and_garden
social['dude'] = social$online_gaming +social$college_uni +social$sports_playing
social['fit'] =social$outdoors+social$health_nutrition+social$personal_fitness+social$eco
#select six groups and form new dataframe
social_new = social[, c(34:39)]
#normalize social_new for better comparison between rows
social_norm = social_new/rowSums(social_new)
#scale social_norm
social_scaled <- scale(social_norm, center=TRUE, scale=TRUE)
#select K
set.seed(123)
fviz_nbclust(social_scaled, kmeans, method = "wss")
From elbow method, it is concluded that k=6 gives us best clustering result
set.seed(123)
fviz_nbclust(social_scaled, kmeans, method = "silhouette")
## Warning: did not converge in 10 iterations
Again, K=6 returns best clusting result.
To triple check, we calculate CH
set.seed(123)
for (i in 2:10){
final <- kmeans(social_scaled, centers = i , nstart = 25)
B = final$betweenss
final$withinss
W = final$tot.withinss
B/W
n = nrow(social_scaled)
k=i
CH = (B/(k-1))/(W/(n-k))
cat("k=", k, ", CH:", CH, "\n")
}
## k= 2 , CH: 1652.997
## k= 3 , CH: 1907.703
## k= 4 , CH: 2224.62
## k= 5 , CH: 2747.85
## k= 6 , CH: 3358.518
## k= 7 , CH: 3126.006
## k= 8 , CH: 2900.054
## k= 9 , CH: 2730.181
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## k= 10 , CH: 2609.697
When K=6, CH reches its max of 3358.518
From the methods above, we get a preliminary idea that our six manually selected groups might be a good proxy for market segmentation
set.seed(123)
#use K=6 to run kmeans
final <- kmeans(social_scaled, centers =6 , nstart = 25)
set.seed(123)
#display centers of seven clusters to see how they are allocated
print(final$centers)
## influencer businessman artists familyguy dude fit
## 1 -0.3958139 -0.1868781 2.2283920 -0.2565413 0.02980669 -0.3856151
## 2 -0.5158493 -0.4328407 -0.3624055 -0.3525595 -0.31408302 1.8864271
## 3 -0.6836458 -0.4425627 -0.1639833 -0.4226420 2.83906155 -0.4246151
## 4 -0.6435441 -0.4117771 -0.2058447 2.0067311 -0.30223302 -0.3872227
## 5 -0.7161993 1.9344438 -0.3028690 -0.1884180 -0.28954096 -0.4019451
## 6 1.1261119 -0.3096098 -0.2503628 -0.4000180 -0.25278227 -0.3889869
From the result, we can see that our hypothetical grouping method works well!
Each of the six groups represents a distinct demographics.
Cluster 1 represents “artists.” Variables include: tv_film, art, music, crafts, small_business
Cluster 2 is “fit.” These people are the outdoor enthuiast that care about health_nutrition, outdoors, personal_fitness, and eco.
Cluster 3 centers on “dude,” a proxy for people who enjoy topics like online_gaming, college_uni, sports_playing.
Cluster 4 centers on “familyguy.” This category captures people who are “family oriented” and enojoy taking about topics like sports_fandom, religion, parenting, school, food, family, and home_and_garden.
Cluster 5 centers heavily on the businessman group, which includes people who like talking about topics like travel, travel, politics, computers, news, automotive, business.
Cluster 6 captures “influencers,” which can be thought as the millennials that like sharing lifestyle related topics on social media. Topics include:photo_sharing, shopping, current_events, dating, cooking, beauty, fashion.
#lets see how many people belong in each group with this line of code:
final$size
## [1] 841 1362 627 1133 1205 2714
cluster 1 (artists) has 841 people
cluster 2 (fit) has 1362 people
cluster 3 (dude) represents 627 people
cluster 4 (familyguy) includes 1133 people
cluster 5 (businessman) has 1205 people
cluster 6 (influencer) includes 2714 people
#kmeans plot to show 6 clusters
fviz_cluster(final, data = social_scaled, ellipse.type = "norm", stand = TRUE, geom = "point")
The plot shows that our six groups are separated in a fairly clear fashion.
Now, let us move on to PCA to further analyze our hypothetical groups.
#Run PCA with 5 ranks
pc1 = prcomp(social_norm, scale=TRUE, rank=5)
## Warning: In prcomp.default(social_norm, scale = TRUE, rank = 5) :
## extra argument 'rank' will be disregarded
loadings = pc1$rotation
scores = pc1$x
#several biplots show the first two PCs and how these groups are segmented
q1 = qplot(scores[,1], scores[,2], color= social_norm$influencer , xlab='Component 1', ylab='Component 2')
q2 = qplot(scores[,1], scores[,2], color = social_norm$businessman, xlab='Component 1', ylab='Component 2')
q3 = qplot(scores[,1], scores[,2], color = social_norm$artists, xlab='Component 1', ylab='Component 2')
q4 = qplot(scores[,1], scores[,2], color = social_norm$familyguy, xlab='Component 1', ylab='Component 2')
q5 = qplot(scores[,1], scores[,2], color = social_norm$dude, xlab='Component 1', ylab='Component 2')
q6 = qplot(scores[,1], scores[,2], color = social_norm$fit, xlab='Component 1', ylab='Component 2')
These plots showcase where each hypothetical group belong in the PCA two-domention result. A plot represent a twitter user, and the more red it is, the the more percentage of the user’s posts relate to the corresponding group.
In other words, the more red it is, the more the user belongs to a hypothetical group we created.
The influencer
#influencer
q1+scale_color_gradient(low="ivory", high="red")
This plot shows where influencer sits in this dimension.
The businessman
#Businessman
q2+scale_color_gradient(low="ivory", high="red")
This plot shows businessman (red) versus others (white).
The artist
#Artists
q3+scale_color_gradient(low="ivory", high="red")
This plot points out those where are artists are.
The familyguy
#Familyguy
q4+scale_color_gradient(low="ivory", high="red")
This plot identifies the familyguy group.
The dude
#Dude
q5+scale_color_gradient(low="ivory", high="red")
This plot shows where “dude” sit related to others
The fit
#Fit
q6+scale_color_gradient(low="ivory", high="red")
This plot clearly identifies the “fit” group
**Interpretation: component1 does a great job separating influencer (right) from familyguy and businessman (left of graph) Component 2 is great at separating “fit” (upper) from “artists” and “dude” (lower)
o1 = order(loadings[,1], decreasing=TRUE)
colnames(social_norm)[head(o1,2)]
## [1] "influencer" "fit"
from the formula below, it is clear that PC1, 2 has the ability to separate influencer and fit out from the rest of the data; result aligns with plots
loadings
## PC1 PC2 PC3 PC4 PC5
## influencer 0.78837083 -0.1120214 0.18897859 -0.18558798 -0.04929072
## businessman -0.35333354 -0.2033397 0.63810200 0.50052420 -0.02318891
## artists -0.12683750 -0.4489010 -0.30143877 -0.05915333 0.79131622
## familyguy -0.47330634 0.1299075 0.08223908 -0.74344082 -0.15779690
## dude -0.11307903 -0.3699373 -0.61327767 0.24497233 -0.54891098
## fit -0.02708018 0.7686758 -0.28874700 0.31435302 0.21130124
## PC6
## influencer 0.5416062
## businessman 0.4190132
## artists 0.2487053
## familyguy 0.4180165
## dude 0.3360585
## fit 0.4261025
Looking at the vectors, we attain the same results.
From kmeans clustering and PCA, we can conclude that our hypothecial grouping method works very well in identifying users with different interests on social media, or “socialgraphics.” This output can help NutrientH20 better target its audience and focus their social media marketing efforts on a more defined and targeted group of people.
Once again, the groups, variables included, and number of peolpe in each group:
knitr::include_graphics("https://github.com/tiffblahthegiraffe/STA380_HW/blob/master/final.png")